[GPU]qwen3 moe fused compressed #32536

riverlijunjie · 2025-10-24T00:23:45Z

Details:

Qwen3 moe model support for weight fusion compression
moe transformation: FuseVectorizedMOE3GEMM->ConvertMOEToMOECompressed->FuseMOECompressed
ov::intel_gpu::op::MOEFusedCompressed fuses softmax_topk/onehot into moe computation for performance optimization
prefill stage leverages gemm kernel to compute each experts output one by one
decode stage leverages ocl kernels to compute experts output in parallel.
moe exec graph:

Tickets:

CVS-169299

Copilot

Pull Request Overview

This PR adds GPU support for Qwen3 MoE (Mixture of Experts) models with fused compressed weight optimization. The implementation introduces a transformation pipeline that converts standard MoE operations to compressed format and fuses routing operations (softmax/topk/onehot) into the MoE computation for improved performance.

Key Changes:

New transformation passes: FuseVectorizedMOE3GEMM → ConvertMOEToMOECompressed → FuseMOECompressed
Dual execution strategy: GEMM kernels for prefill stage, OCL kernels for decode stage
Memory optimization through weight compression and operation fusion

Reviewed Changes

Copilot reviewed 25 out of 25 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
transformations_pipeline.cpp	Registers new MOE transformation passes in the GPU plugin pipeline
moe_opt.cpp/hpp	Implements optimized MOE execution with oneDNN and custom OCL kernels
moe_compressed.cpp/hpp	Defines base MOECompressed operation with compressed weight configuration
moe_fused_compressed.cpp/hpp	Defines MOEFusedCompressed that includes fused routing operations
convert_moe_to_compressed.cpp/hpp	Transformation to convert standard MOE to compressed weight format
fuse_moe_compressed.cpp/hpp	Transformation to fuse routing subgraph into MOE operation
keep_moe_const_precision.cpp/hpp	Prevents precision conversion of compressed weights and zero points
moe_opt.cl, moe_mlp.cl	OpenCL kernels for softmax_topk, gather, scatter, and MLP operations
paged_attention_opt.cpp	Adds workaround for OCL resource issue with small input tokens

Comments suppressed due to low confidence (1)

src/plugins/intel_gpu/src/graph/impls/ocl_v2/moe_opt.cpp:1

Remove commented-out unused code rather than leaving it in the codebase.

// Copyright (C) 2025 Intel Corporation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/plugins/intel_gpu/src/graph/impls/ocl_v2/moe_3gemm_swiglu_fuse.cl

src/plugins/intel_gpu/src/graph/impls/ocl_v2/moe_mlp.cl

src/plugins/intel_gpu/src/graph/impls/ocl_v2/moe_opt.cpp

src/plugins/intel_gpu/src/graph/impls/ocl_v2/moe_3gemm_swiglu_mlp.cl

yeonbok · 2025-11-04T17:52:22Z

src/plugins/intel_gpu/include/intel_gpu/op/moe_fused_compressed.hpp

+    ///   shape [num_experts, hidden_size, group_num, 1]
+    ///   10: w2_zp - expert zp for final projection for compressed experts,
+    ///   shape [num_experts, hidden_size, group_num, 1]
+    /// \param config Configuration for the MOE operation


This description is for 3gemm_Swiglu_type only. Please mention that

yeonbok · 2025-11-04T17:56:01Z

src/plugins/intel_gpu/src/plugin/transformations/convert_moe_to_compressed.cpp

-            config.top_k = topk_shape.back();
-            config.out_type = ov::element::dynamic;
+            auto topk_shape = pattern_map.at(topk_m).get_partial_shape();
+            OPENVINO_ASSERT(topk_shape[1].is_static(), "k dimenion in moe topk input should be static.");


Please use OPENVINO_THROW for important checking

yeonbok · 2025-11-04T18:00:33Z

src/plugins/intel_gpu/src/plugin/transformations/fuse_moe_compressed.hpp

+class FuseMOECompressed: public ov::pass::MatcherPass {
+public:
+    OPENVINO_MATCHER_PASS_RTTI("FuseMOECompressed");
+    FuseMOECompressed();


This naming is also general, but it is for Gemm3 pattern's target. Please rename this too, for reducing the confusion.

yeonbok · 2025-11-04T18:01:55Z

src/plugins/intel_gpu/tests/unit/test_cases/moe_gpu_test.cpp

+TEST(moe_compressed_gpu, moe_accuracy_test) {
+    auto& engine = get_test_engine();
+    if (!engine.get_device_info().supports_immad) {
+        std::cout << "not support immad, skip test" << std::endl;


Please remove debug print.

yeonbok · 2025-11-04T18:47:22Z

src/plugins/intel_gpu/include/intel_gpu/primitives/moe_fused_compressed.hpp

+struct moe_fused_compressed : public primitive_base<moe_fused_compressed> {
+    CLDNN_DECLARE_PRIMITIVE(moe_fused_compressed)
+
+    moe_fused_compressed() : primitive_base("", {}) {}


Please modify the primitive name too, for the specifc target pattern.

yeonbok · 2025-11-04T18:49:44Z

src/plugins/intel_gpu/src/graph/include/moe_inst.h

+namespace details {}
+
+template <>
+struct typed_program_node<moe_fused_compressed> : public typed_program_node_base<moe_fused_compressed> {


File name is too general: moe_inst.h
Pleae rename all the relevant primitives, test names, inst, node name to
moe_fused_3gemm_swiglu

yeonbok · 2025-11-04T18:56:55Z

I checked there is no impact for gpt-oss. Only minor comments added

github-actions bot added the category: GPU OpenVINO GPU plugin label Oct 24, 2025

riverlijunjie force-pushed the river/qwen3_moe_fused_compressed branch from f35b2cb to 4ccdcf1 Compare October 24, 2025 02:13

github-actions bot added the category: transformations OpenVINO Runtime library - Transformations label Oct 26, 2025

peterchen-intel marked this pull request as ready for review October 29, 2025 01:15

peterchen-intel requested review from a team as code owners October 29, 2025 01:15

peterchen-intel requested review from mryzhov and removed request for a team October 29, 2025 01:15

peterchen-intel added the do_not_merge label Oct 29, 2025

github-actions bot removed the category: transformations OpenVINO Runtime library - Transformations label Oct 29, 2025

riverlijunjie force-pushed the river/qwen3_moe_fused_compressed branch 2 times, most recently from faa6533 to 836d35c Compare October 30, 2025 04:18

riverlijunjie requested a review from a team as a code owner October 30, 2025 07:23

github-actions bot added the category: Core OpenVINO Core (aka ngraph) label Oct 30, 2025

peterchen-intel mentioned this pull request Oct 30, 2025

[GPU]MOE to MOECompressed #32472

Closed

peterchen-intel changed the title ~~[WIP][GPU]qwen3 moe fused compressed~~ [GPU]qwen3 moe fused compressed Oct 30, 2025

peterchen-intel assigned yeonbok Oct 30, 2025

peterchen-intel added the Code Freeze label Oct 30, 2025

github-actions bot added the category: transformations OpenVINO Runtime library - Transformations label Oct 30, 2025

peterchen-intel requested a review from Copilot October 31, 2025 03:46

Copilot AI reviewed Oct 31, 2025

View reviewed changes

riverlijunjie force-pushed the river/qwen3_moe_fused_compressed branch from 0fd5af0 to 827a9f6 Compare October 31, 2025 05:26

peterchen-intel added pr: needs tests PR needs tests updating priority: high High piority and removed do_not_merge labels Oct 31, 2025

peterchen-intel mentioned this pull request Oct 31, 2025

[WIP][GPU]qwen3 moe compressed support #32473

Closed

riverlijunjie force-pushed the river/qwen3_moe_fused_compressed branch from b0b0841 to 001bed4 Compare October 31, 2025 08:56

peterchen-intel added the do_not_merge label Oct 31, 2025

qwen3 moe_compressed primitive_impl

af6463d

chenhu-wang force-pushed the river/qwen3_moe_fused_compressed branch from 358a015 to c824b67 Compare November 4, 2025 13:18

github-actions bot added the category: Core OpenVINO Core (aka ngraph) label Nov 4, 2025

chenhu-wang force-pushed the river/qwen3_moe_fused_compressed branch from c824b67 to 79d6a13 Compare November 4, 2025 13:48

update code comment

0eb170e

github-actions bot removed category: Core OpenVINO Core (aka ngraph) category: transformations OpenVINO Runtime library - Transformations labels Nov 4, 2025

Merge branch 'master' into river/qwen3_moe_fused_compressed

6f69933

yeonbok reviewed Nov 4, 2025

View reviewed changes

update for reviewing comments: renaming and minor

eb677f5

github-actions bot added the category: transformations OpenVINO Runtime library - Transformations label Nov 5, 2025

riverlijunjie added 2 commits November 5, 2025 10:05

Switch on FuseMOE pass of offline transformation

8e00172

rename test data

8f5b815

riverlijunjie requested review from chenhu-wang, peterchen-intel, yeonbok and zhaixuejun1993 November 5, 2025 02:44

peterchen-intel approved these changes Nov 5, 2025

View reviewed changes

yeonbok approved these changes Nov 5, 2025

View reviewed changes

yeonbok enabled auto-merge November 5, 2025 03:16

chenhu-wang approved these changes Nov 5, 2025

View reviewed changes

yeonbok added this pull request to the merge queue Nov 5, 2025

moslex added this to the 2025.4 milestone Nov 5, 2025

Merged via the queue into openvinotoolkit:master with commit e61e47a Nov 5, 2025
221 of 223 checks passed

[GPU]qwen3 moe fused compressed #32536

[GPU]qwen3 moe fused compressed #32536

Uh oh!

Conversation

riverlijunjie commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Details:

Tickets:

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yeonbok commented Nov 4, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

riverlijunjie commented Oct 24, 2025 •

edited

Loading